Use of Solr and Xapian in the Invenio document repository software

نویسندگان

  • Patrick O. Glauner
  • Jan Iwaszkiewicz
  • Jean-Yves LeMeur
  • Tibor Simko
چکیده

Invenio is a free comprehensive web-based document repository and digital library software suite originally developed at CERN. It can serve a variety of use cases from an institutional repository or digital library to a web journal. In order to fully use full-text documents for efficient search and ranking, Solr was integrated into Invenio through a generic bridge. Solr indexes extracted full-texts and most relevant metadata. Consequently, Invenio takes advantage of Solr’s efficient search and word similarity ranking capabilities. In this paper, we first give an overview of Invenio, its capabilities and features. We then present our open source Solr integration as well as scalability challenges that arose for an Inveniobased multi-million record repository: the CERN Document Server. We also compare our Solr adapter to an alternative Xapian adapter using the same generic bridge. Both integrations are distributed with the Invenio package and ready to be used by the institutions using or adopting Invenio.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HEPData: a repository for high energy physics data

The Durham High Energy Physics Database (HEPData) has been built up over the past four decades as a unique open-access repository for scattering data from experimental particle physics papers. It comprises data points underlying several thousand publications. Over the last two years, the HEPData software has been completely rewritten using modern computing technologies as an overlay on the Inve...

متن کامل

Comparison of Selected Software Systems for Creation of Digital Libraries from the Field of Open Source for the Needs of the NRGL STL

This document contains detailed characteristics and orientation comparison of software systems used to build digital libraries. It was produced for the National Technical Library (NTK) within the project National Repository of Grey Literature (NRGL). It describes and compares these systems: CDS Invenio, DSpace, Eprints, Fedora and Greenstone. There were selected, more or less intuitively well-k...

متن کامل

Building Digital Collections Using Open Source Digital Repository Software: A Comparative Study

The last decade a great number of digital library and digital repository systems have been developed and published as open-source software. The variety of available software systems is a factor of confusion when an organization is planning to build a repository infrastructure to host its collections. To simplify the decision process five widely used open-source repository software systems are c...

متن کامل

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1310.0250  شماره 

صفحات  -

تاریخ انتشار 2013